From interaction networks to interfaces, scanning intrinsically disordered regions using AlphaFold2

The revolution brought about by AlphaFold2 opens promising perspectives to unravel the complexity of protein-protein interaction networks. The analysis of interaction networks obtained from proteomics experiments does not systematically provide the delimitations of the interaction regions. This is of particular concern in the case of interactions mediated by intrinsically disordered regions, in which the interaction site is generally small. Using a dataset of protein-peptide complexes involving intrinsically disordered regions that are non-redundant with the structures used in AlphaFold2 training, we show that when using the full sequences of the proteins, AlphaFold2-Multimer only achieves 40% success rate in identifying the correct site and structure of the interface. By delineating the interaction region into fragments of decreasing size and combining different strategies for integrating evolutionary information, we manage to raise this success rate up to 90%. We obtain similar success rates using a much larger dataset of protein complexes taken from the ELM database. Beyond the correct identification of the interaction site, our study also explores specificity issues. We show the advantages and limitations of using the AlphaFold2 confidence score to discriminate between alternative binding partners, a task that can be particularly challenging in the case of small interaction motifs.


nature portfolio | reporting summary
April 2023

Data analysis
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers.We strongly encourage code deposition in a community repository (e.g.GitHub).See the Nature Portfolio guidelines for submitting code & software for further information.

Data Policy information about availability of data
All manuscripts must include a data availability statement.This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy Research involving human participants, their data, or biological material Policy information about studies with human participants or human data.See also policy information about sex, gender (identity/presentation), and sexual orientation and race, ethnicity and racism.

Reporting on sex and gender
Reporting on race, ethnicity, or other socially relevant groupings

Ethics oversight
Note that full information on the approval of the study protocol must also be provided in the manuscript.There is no sex-and gender-based analysis in our study since these characteristics are unapplicable to the biological macromolecules that were analyzed in this study.

Field-specific reporting
There is no race, ethnicity, or other socially relevant groupings in our study since these characteristics are unapplicable to the biological macromolecules that were analyzed in this study.

See above
There was no recruitment in the present study.
There was no need for an organization to control these aspects absent from our study.

nature portfolio | reporting summary
April 2023

Life sciences study design
All studies must disclose on on these points even when the disclosure is is negative.The sample size was constrained by by our concern that none of of the case analyzed in in the study should be be similar or or redundant with any of of the structures that were used for the training of of AlphaFold2 parameters.Given the conditions that applied both at at the sequence and structural level, the sample size was 42 42 complexes involving a receptor and a small intrinsically disordered protein ligand.We We subsequently expanded the dataset tested with 923 test cases taken from the ELM database with the risk that some of of the tested cases have been used during the training of of AlphaFold2 parameters.
The exclusion criteria were predefined prior to to any generation and evaluation of of structural models.
-A first criteria applied to to the released date of of the PDB structure to to ensure no no overlap with that of of the AlphaFold2 training dataset (keep only PDB structures with released date after May 1st, 2018 ). ).
-A second criteria described in in the Methods section was to to exclude any case of of complex with significant sequence similarity to to protein assemblies present in in the PDB database before May 1st, 2018.
-A third criteria described in in the Methods section was to to exclude any case of of complex with significant structural similarity to to protein assemblies present in in the PDB database before May 1st, 2018.
For the ELM database, we we excluded the complexes not supported by by a pubmed ID, those which had no no PDB reference reported either exact of of homologous, thos for which multiple ELM motifs of of the same ELM type where present in in the ligand side.
All the structural models were generated from the repetition of of 5 independent runs of of the AlphaFold2 algorithm with 3 recycles each generating 5 structural models following the recommendation of of the AlphaFold2 developers.In In most cases the success rates for the production of of correct models were consistent among the five replicates.Exceptions to to that trend are due to to the stochasticity of of the AlphaFold2 search and are reported in in Supplementary Table 3 reporting all the scores and evaluation grades of of all the models sampled for each sample case.A different random seed is is used to to generate each of of the the 5 independent runs of of AlphaFold2 and all seeds used are provided in in the log of of the runs provided in in the archive https://doi.org/10.5281/zenodo.7838024for the 42 42 non redundant dataset.
Not applicable for the 42 42 non redundant dataset since it was not divided into subgroups .For the ELM dataset, to to correct for the unbalanced distribution of of ELM motifs within the 84 84 categories of of ELM types, we we evaluated the predicted success rates by by repeated stratified sampling with 1000 repeats of of randomly selecting one ELM motif from each of of the 84 84 ELM type categories.
Blinding was not required in in the study since the test dataset was defined so so that it it is is not overlapping with any of of the structures of of complex used to to train AlphaFold2 parameters.
materials, systems and methodsWe We require information from authors about some types of of materials, experimental systems and methods used in in many studies.Here, indicate whether each material, system or or method listed is is relevant to to your study.If If you are not sure if if a list item applies to to your research, read the appropriate section before selecting a response.

-
/www.bioinf.org.uk/software/swreg.html)-SoftwareSingularity(versionV3.8.3 from https://github.com/apptainer/singularity/releases/tag/v3.8.3) -Software ColabFold (version 1.3, from https://github.com/sokrypton/ColabFold,commit5ddfd0bbadbffc5757ee1912107704aec3cd8c04withcorrectionfor the random seed and display of the ipTMscore) for the 42 non-redundant dataset -Software ColabFold (version 1.5.2, from https://github.com/sokrypton/ColabFold,commit3e99c44eec189ec27f6d120af851adb7ff6aa2a2)All the scores calculated for every generated model of the 42 non redundant dataset are provided in Supplementary Table3-All the accessions, delimitations and scores calculated for every best model for each 923 cases of the ELM dataset are provided in Supplementary Table5-All the sequence alignments, the calculated models and the reference structures used are provided in: https://doi.org/10.5281/zenodo.7838023 Please select the one below that is the best fit for your research.If you are not sure, read the appropriate sections before making your selection.Databases and datasets used in this study: -Initial list of protein-peptide complexes retrieved from the PDB server databse on April 1, 2022 (https://www.rcsb.org/)-Fullamino-acidsequence were retrieved from the Uniprot database (https://www.uniprot.org/)usinguniprotIDs indicated in the PDB mmcif files.-Theuniref30_2103database (available at https://colabfold.mmseqs.com/) was used to generate the multiple sequence alignments which are all provided in https://doi.org/10.5281/zenodo.7838023-TheELM database (available at http://elm.eu.org/downloads.htmlversionJuly 3, 2023) was used to retrieve all the potential pairs of receptor/ligand complexes Data availability: -All the accession codes and the delimitations used are provided without any restrictions in Supplementary Table1